Compositions and Methods for Analyzing Modified Nucleotides

ABSTRACT

A method for identifying any of the presence, location and phasing of modified cytosines (C) in long stretches of nucleic acids is provided. In some embodiments, the method may comprise (a) reacting a first portion of a nucleic acid sample containing at least one C and/or at least one modified C with a DNA glucosyltransferase and a cytidine deaminase to produce a first product and/or reacting a second portion of the sample with a dioxygenase, optionally a DNA glucosyltransferase and a cytidine deaminase to produce a second product and; (b) comparing the sequences from the first and optionally the second product obtained in (a), or amplification products thereof, with each other and/or an untreated reference sequence to determine which Cs in the initial nucleic acid fragment are modified. A modified TET methylcytosine dioxygenase with improved efficiency compared to unmodified TET2 at converting methylcytosine to carboxymethylcytosine is also provided.

CROSS REFERENCE

This application is a divisional of U.S. application Ser. No. 15/441,431filed Feb. 24, 2017 which is a continuation-in-part of InternationalApplication No. PCT/US16/59447 filed Oct. 28, 2016 which claims thebenefit of US Provisional Application Nos: 62/248,872, filed Oct. 30,2015, 62/257,284, filed Nov. 19, 2015 and 62/271,679, filed Dec. 28,2015, which applications are incorporated by reference herein.

BACKGROUND

The ability to phase modified nucleotides (e.g., methylated orhydroxymethylated nucleotides) in a genome (i.e., determine whether twoor more modified nucleotides are linked on the same single DNA moleculeor on different DNA molecules) can provide important information inepigenetic studies, particularly for studies on imprinting, generegulation, and cancer. In addition, it would be useful to know whichmodified nucleotides are linked to sequence variations.

Modified nucleotides cannot be phased using conventional methods forinvestigating DNA modification because such methods typically involvebisulfite sequencing (BS-seq). In BS-seq methods, a DNA sample istreated with sodium bisulfite, which converts cytosines (C) to uracil(U), but methylcytosine (^(m)C) remains unchanged. Whenbisulfite-treated DNA is sequenced, unmethylated C is read as thymine(T), and ^(m)C is read as C, yielding single-nucleotide resolutioninformation about the methylation status of a segment of DNA.

However, sodium bisulfite is known to fragment DNA (see, e.g., Ehrich M2007 Nucl. Acids Res. 35:e29), making it impossible to determine whethermodified nucleotides are linked on the same DNA molecule. Specifically,it is impossible for nucleotide modifications to be phased in the sameway that sequence variants (e.g., polymorphisms) are phased becausethose methods require intact, long molecules.

Moreover, bisulfite sequencing displays a bias toward cytosine (C)adjacent to certain nucleotides and not others. It would be desirable toremove the observed bias.

SUMMARY

Provided herein are methods for phasing modified nucleotides that do notrequire bisulfite treatment.

Further, such methods can be implemented in a way that distinguishesbetween ^(m)C and hydroxymethylcytosine (^(hm)C) or C, formylcytosine(^(f)C) and carboxylcytosine (^(ca)C), providing significant advantagesover conventional methods.

This disclosure provides, among other things, compositions and methodsto detect and phase methylation and/or hydroxymethylation of nucleotidesor unmodified nucleotides in cis or trans at a single molecule level inlong stretches of DNA. In various embodiments, glucosylation andoxidation reactions overcome the observed inherent deamination of ^(hm)Cand ^(m)C by deaminases. Deaminases converts ^(m)C to T and C to U whileglucosylhydroxymethylcytosine (^(ghm)C) and ^(Ca)C are not deaminated.Examples of deaminases include APOBEC (apolipoprotein B mRNA editingenzyme, catalytic polypeptide-like). Embodiments utilize enzymes thathave substantially no sequence bias in glycosylation, oxidation anddeamination of cytosine. Moreover, embodiments provide substantially nonon-specific damage of the DNA during the glycosylation, oxidation anddeamination reactions.

In some embodiments, a DNA glucosyltransferase (GT) for example betaglucosyltransferase (BGT) is utilized for glucosylating ^(hm)C toprotect this modified base from deamination. However, a person ofordinary skill in the art will appreciate that other enzymatic orchemical reactions may be used for modifying the ^(hm)C to achieve thesame effect. One alternative example provided herein is the use ofPyrrolo-dC for protecting cytosine from being converted to uracil bycytidine deaminase.

In general, in one aspect, methods for detecting nucleic acid (NA)methylation are provided that include subjecting the NA to enzymaticglucosylation, enzymatic oxidation and enzymatic deamination where anunmodified C is converted to a U, ^(m)C is converted to T, an ^(hm)Cthat is glucosylated (^(ghm)C) and remains C and a modified C that isoxidized to ^(ca)C remain C. The majority of modified C are predicted tobe ^(m)C. For some diagnostic purposes, differentiating between ^(m)Cand ^(hm)C is not required. Accordingly, it is sufficient to utilize asingle pathway of oxidation and glucosylation followed by deamination.Where it is desirable to distinguish ^(m)C from ^(hm)C, this can beachieved by a performing two different reactions on two aliquots of thesame sample and subsequently comparing the sequences of the DNAobtained. One reaction utilizes a GT and a cytidine deaminase while asecond reaction utilizes a methylcytosine dioxygenase and a cytidinedeaminase. It has been found here that the presence of GT in a reactionwith a methylcytosine deoxygenase results in an outcome which shows animproved conversion rate (greater than 97%, 98% or 99% conversion,preferably at least 99%) of modified bases and more accurate mappingthan would otherwise be possible. Methylcytosine dioxygenase variantsare described herein which catalyze the conversion of the ^(m)C to^(hm)C to ^(f)C and then ^(ca)C with little or no bias caused byneighboring nucleotides. These and other improved properties of suchvariants are also described herein. Methods using enzymes describedherein utilizing phasing or other sequencing methods are more time andsample efficient and provide improved accuracy for diagnostic sequencingof ^(m)C and other modified nucleotides.

In each of these methods, it is desirable to compare the product of theenzyme reactions with each other and/or an unreacted sequence. Comparingsequences can be achieved by hybridization techniques and/or bysequencing. Prior to comparing sequences, it may be desirable to amplifythe NA using PCR or isothermal methods and/or clone the reactedsequence.

The NA fragments being analyzed may be DNA, RNA or a hybrid or chimeraof DNA and RNA. The NA fragments may be single stranded (ss) or doublestranded (ds). The NA fragments may be genomic DNA or synthetic DNA.

The size of the fragments may be any size but for embodiments of thepresent invention that utilize single molecule sequencing, fragmentsizes that are particularly advantageous are greater than 1 Kb, 2 Kb, 3kb, 4 kb, 5 kb, 6 Kb, 7 Kb or larger (for example, preferably greaterthan 4 kb) with no theoretical limitation on the upper size although theupper size of the fragment may be limited by the polymerase in theamplification step commonly used prior to sequencing if amplification isneeded.

In some cases, the sequences obtained from the reactions are comparedwith a corresponding reference sequence to determine: (i) which Cs areconverted into a U in the first product for differentiating a ^(m)C froma ^(hm)C; and (ii) which Cs are converted to a U for differentiating anunmodified C from a modified C in the optional second product. In theseembodiments, the reference sequence may be a hypothetical deaminatedsequence, a hypothetical deaminated and PCR amplified sequence or ahypothetical non-deaminated sequence for example.

In any embodiment, the first and second products may be amplified priorto sequencing. In these embodiments, any U's in the first and secondproducts may be read as T's in the resultant sequence reads.

In any embodiment, the methylcytosine dioxygenase may convert ^(m)C and^(hm)C to ^(ca)C so that cytidine deaminase cannot deaminate ^(m)C or^(hm)C. The methylcytosine dioxygenase may be a TET protein thatenzymatically converts modified Cs to ^(ca)C.

In any embodiment, the GT may be a DNA β-glucosyltransferase (βGT) orα-glucosyltransferase (aGT) that forms ^(ghm)C so that substantially no^(hm)C is deaminated by the cytidine deaminase.

In any embodiment, the NA sample may contain at least one CpG island. Inanother embodiment, the NA may include at least two modified Cs withnucleotide neighbors selected from CpG, CpA, CpT and CpC.

In any embodiment, the method may comprise determining the location ofthe ^(m)C and/or ^(hm)C on a ss of the NA where the NA is ds.

In any embodiment, the NA is a fragment of genomic DNA and, in somecases, the NA may be linked to a transcribed gene (e.g., within 50 kb,within 20 kb, within 10 kb, within 5 kb or within 1 kb) of a transcribedgene.

The method summarized above may be employed in a variety ofapplications. A method for sample analysis is provided. In someembodiments, this method may comprise one or more of the followingsteps: (a) determining the location of all modified Cs in a test NAfragment to identify a pattern for the modified C; (b) comparing thepattern of C modifications in the test NA fragment with the pattern of Cmodifications in a reference NA; (c) identifying a difference in thepattern of cytosine modifications in the test NA fragment relative tothe reference NA fragment; and (d) determining a pattern of ^(hm)C inthe test NA fragment.

In some embodiments, this method may comprise comparing the pattern of Cmodification or unmodified C for a NA fragment that is linked, in as, toa gene in a transcriptionally active state to the pattern of Cmodifications in the same intact NA fragment that is linked, in as, tothe same gene in a transcriptionally inactive state. In theseembodiments, the level of transcription of the gene may be correlatedwith a disease or condition.

In some embodiments, this method may comprise comparing the pattern ofcytosine modification for a NA fragment from a patient that has adisease or condition with the pattern of C modification in the same NAfragment from a patient that does not have the disease or condition. Inother embodiments, the method may comprise comparing the pattern ofcytosine modification for a NA fragment from a patient is undergoing atreatment with the pattern of C modification in the same intact NAfragment from a patient that has not been treated with the agent. Inanother embodiment, detected differences in the pattern of Cmodification in the test NA fragment relative to the reference NAfragment corresponds to a variant single nucleotide polymorphism, aninsertion/deletion or a somatic mutation associated with a pathology.

A variety of compositions are also provided. In some embodiments, thecomposition may comprise a NA, wherein the NA comprises: a) G, A, T, U,C; b) G, A, T, U, ^(ca)C and no C and/or C) G, A, T, U and ^(ghm)C andno C and/or G, A, T, U, ^(ca)C and ^(ghm)C and no C. In someembodiments, the composition may further comprise a cytidine deaminaseor mutant thereof (as described in U.S. Pat. No. 9,121,061), or amethylcytosine dioxygenase or mutant thereof as described below.

A kit is also provided. In some embodiments, the kit may comprise a GT,a methylcytosine dioxygenase e.g., a mutant methylcytosine dioxygenase(TETv as described below) and a cytidine deaminase, as well asinstructions for use. As would be apparent, the various components ofthe kit may be in separate vessels.

In general, in one aspect, a protein is described that includes an aminoacid sequence that is at least 90% identical to SEQ ID NO:1; andcontains SEQ ID NO:2. In one aspect, the protein is a fusion proteinthat includes an N-terminal affinity binding domain. The protein mayhave methylcytosine dioxygenase activity where the methylcytosinedeoxygenase activity is similarly effective for NCA, NCT, NCG and NCC ina target DNA. The protein may be employed in any method herein.

In any embodiment, the protein may be a fusion protein. In theseembodiments, the variant protein may comprise an N-terminal affinitybinding domain.

Also provided by this disclosure is a method for modifying a naturallyoccurring DNA containing one or more methylated C. In some embodiments,this method may comprise combining a sample comprising the DNA with avariant methylcytosine dioxygenase to make a reaction mix; andincubating the reaction mix to oxidize the methylated cytosine in theDNA.

In some embodiments, the reaction mix may further comprising analyzingthe oxidized sample, e.g., by sequencing or mass spectrometry.

In some embodiments, the reaction mix may further comprise a GT.

In some embodiments, the method may be done in vitro, in a cell-freereaction.

In some embodiments, the method may be done in vitro, e.g., in culturedcells.

The above-summarized variant methylcytosine dioxygenase can be used as amethylcytosine dioxygenase in any of the methods, compositions or kitsdescribed below.

In general in one aspect, a method is provided for determining thelocation of modified cytosines in a nucleic acid fragment, thatincludes: (a) reacting a nucleic acid sample containing at least one Cand/or at least one modified C with a methylcytosine dioxygenase and aDNA glucosyltransferase in a single buffer either together orsequentially; (b) reacting the product of (a) with a cytidine deaminase;and (c) comparing the sequences obtained in (a), or amplificationproducts thereof, with an untreated reference sequence to determinewhich Cs in the initial nucleic acid fragment are modified. In oneaspect, the methylcytosine dioxygenase is an amino acid sequence that isat least 90% identical to SEQ ID NO:1; and contains the amino acidsequence of SEQ ID NO:2.

BRIEF DESCRIPTION OF THE FIGURES

The patent or application file contains at least one drawing executed incolor. Copies of this patent or patent application publication withcolor drawings will be provided by the office upon request and paymentof the necessary fee.

Certain aspects of the following detailed description are bestunderstood when read in conjunction with the accompanying drawings. Itis emphasized that, according to common practice, the various featuresof the drawings are not to scale. On the contrary, the dimensions of thevarious features are arbitrarily expanded or reduced for clarity.Included in the drawings are the following figures:

FIG. 1A shows a schematic diagram of a method for protecting modified Csfrom deamination by a cytidine deaminase and a ^(m)C dioxygenase, forexample a TET enzyme such as TETv, that converts ^(m)C and ^(hm)C (notC) to ^(ca)C that are insensitive to deamination. After methylcytosinedioxygenase treatment, deamination of unmodified C only occurs resultingin its replacement by U. From left to right: SEQ ID NO.:20, SEQ ID NO.20, SEQ ID NO. 21.

FIG. 1B shows a second method for protecting ^(hm)C but not ^(m)C fromdeamination by APOBEC enzyme. Here ^(hm)c is glucosylated using a βGTfor example T4-βGT or aGT for example T4-αGT. C and ^(m)C are modifiedby a cytidine deaminase (e.g. deaminase) to a U and a T respectively.From left to right: SEQ ID NO:20, SEQ ID NO:20, SEQ ID NO:22.

FIG. 1C is a table showing readouts of bases of a genomic sample afterPCR amplification and Sanger sequencing or NGS sequencing.

FIG. 2A-2B shows the methylation and hydroxymethylation status of mousegenomic DNA.

FIG. 2A shows the distribution of ^(m)C and ^(hm)C at a single locus(locus size: 1078 bp) of mouse fibroblast NHI/3T3 genomic DNA followingmethylcytosine dioxygenase (here TETv) and cytidine deaminase treatment(according to FIG. 1A).

FIG. 2B shows the distribution of ^(hm)C at the same locus as FIG. 3Aafter GT (here βGT) and cytidine deaminase treatment (according to FIG.1B).

FIG. 2C is a summary of LC-MS data of methylation status of a locus ingenomic DNAs of mouse fibroblasts.

FIG. 3A-3E shows that ss DNA is not damaged during preparation andanalysis using TETv and/or βGT and cytidine deaminase in contrast tomethods that use conventional bisulfite treatment (for bisulfite methodsee for example, Flolmes, et al. PloS one 9, no. 4 (2014): e93933).

FIG. 3A shows results obtained with βGT and cytidine deaminase. Sixdifferent fragment sizes (388 bp, 731 bp, 1456 bp, 2018 bp, 3325 bp, and4229 bp) were analyzed after treatment with a cytidine deaminase andβGT. Full length fragments in each size category were amplified. Nofragmentation was observed.

FIG. 3B shows results obtained with TETv and cytidine deaminase. 6different fragment sizes (388 bp, 731 bp, 1456 bp, 2018 bp, 3325 bp, and4229 bp) were analyzed after treatment with a cytidine deaminase andβGT. Full length fragments in each size category were amplified. Nofragmentation was observed.

FIG. 3C shows results obtained with bisulfite converted DNA. 6 differentfragment sizes (388 bp, 731 bp, 1456 bp, 2018 bp, 3325 bp, and 4229 bp)were analyzed after bisulfite treatment. Full length fragments in eachsize category were amplified. When bisulfite converted DNA wasamplified, only the two smallest fragments were obtained because of thebreakdown of the larger fragments by the bisulfite method.

FIG. 3D shows results obtained with the primers for 5030 bp amplicon,and 5378 bp amplicon after treating DNA before amplification with T4-βGT(^(hm)C detection) or TETv (^(m)C+^(hm)C detection), and cytidinedeaminase (see FIGS. 1A and 1B). Each amplification is shown intriplicate. No fragmentation was observed.

FIG. 3E shows that that a 15 kb fragment of ss DNA containing^(m)C/^(hm)C is not damaged during preparation and analysis usingTETv/βGT/cytidine deaminase enzymes in contrast to methods that useconventional bisulfite treatment. The light blue line represents thedenatured ss DNA of the 15 kb fragment which is also the control. Thered line is APOBEC deamination on glucosylated DNA. The dark blue is DNAdeamination on TETv oxidized DNA. And the green is bisulfite treatedDNA.

FIGS. 4A and 4B shows that cytidine deaminase does not deaminate themodified base-Pyrrolo-dC (Glen Research, Sterling, Va.). This modifiedbase can be used in Illumina NGS library construction to protect C inthe adapters ligated to the ends of DNA fragments in the library fromdeamination prior to cytidine deaminase treatment.

FIG. 4A shows the results of treating oligonucleotide(5′-ATAAGAATAGAATGAATXGTGAAATGAA TATGAAATGAATAGTA-3′, X=Pyrrolo-dC, SEQID NO:4) with cytidine deaminase at 37° C. for 16 hours (upper line(black)). The control (lower line (grey)) is untreated SEQ ID NO:4. Nodifference was observed between the sample and the control confirmingthat cytidine deaminase does not deaminate Pyrrolo-dC.

FIG. 4B shows a chromatogram (LC-MS) of an adaptor containing PyrroledC, with the following sequence, whereX=Pyrrolo-dC.5′/5Phos/GATXGGAAGAGXAXAXGTXTGAAXTXXAGTX/deoxyU/AXAXTXTTTXXXTAXAXGAXGXTXTTXXGATCT(SEQ ID NO:5). The LC-MS chromatogram confirms that all C's are replacedby Pyrrolo-dC, with no trace of contaminated Cs.

FIG. 5 shows that the method described in Example 4 that providessequences from Next generation sequencing (NGS) using an Illuminaplatform as an example of Deaminase-seq provides superior conversionefficiency compared with BS-seq. Unmethylated lambda DNA was used as anegative control to estimate the non-conversion error rate (methylated Ccalls/total C calls). CD^(m)C reaction (left slashes) has the smallesterror rate of 0.1% for both CpG and CH (H=A,C,T) context. Bisulfiteconversion using Zymo kit (right slashes) has 3 times higher error ratethan the method shown in FIGS. 1A and 1B (0.4%), and bisulfiteconversion by Qiagen (white) has even higher error rate of 1.6% for CpGcontext and 1.5% for CH context.

FIG. 6A-6D shows that Deaminase-seq displays no systematic sequencepreference while BS-seq generates a significant amount of conversionerrors most notably in a CA context. Pie charts depict the numbers andpercentages of false positive methylation calls in each C dinucleotidecontext in the unmethylated lambda genome by different methods.

FIG. 6A shows a pie chart of wild type lambda genome as a control withthe naturally occurring distribution of CT, CA, CG and CC.

FIG. 6B shows the representation of ^(m)C in a lambda genome where everyC has been methylated using Deaminase-seq. The observed distributionmatches that found in FIG. 6A.

FIG. 6C shows the representation of ^(m)C in a lambda genome where everyC has been methylated using BS-seq (Qiagen). The observed distributionis not consistent with that found in FIG. 6A.

FIG. 6D shows the representation of ^(m)C in a lambda genome where everyC has been methylated using BS-seq (Zymo). The observed distribution isnot consistent with that found in FIG. 6A.

FIG. 7 shows that Deaminase-seq (Illumina) covered more CpG sites anddetected more methylated CpG sites than both BS-seq libraries using thesame library analysis and the same number of sequencing readsdemonstrating that Deaminase-seq is a more efficient and cost effectivemethod than BS-seq.

FIG. 8A-8C shows that Deaminase-seq provides an even genome-widesequence coverage in the mouse genome from Illumina generated reads ofoverlapping fragments. Three histograms of CpG coverage are shown wherethe 3 methods have the same mean (5×) and median (4×) sequencing depthfor CpG sites. However, Deaminase-seq has fewer outliers (sites withvery low or very high copy numbers) when compared with BS-seq kits fromZymo and Qiagen. Three data sets are shown in which, library sizenormalized.

FIG. 8A shows the distribution of reads for DNA Deaminase-seq.

FIG. 8B shows the distribution of reads for BS-seq (Qiagen).

FIG. 8C shows the distribution of reads for BS-seq (Zymo).

FIG. 9 shows that Deaminase-seq provides higher coverage in CpG islandsthan BS-seq for the same number of sequencing reads, Deaminase-seq givesnearly 2 times as much coverage as BS-seq in the CpG islands.

FIG. 10 provides a loci specific map of ^(hm)C on a genomic fragmentfrom mouse chromosome 8. Deaminase-seq (FIGS. 1A and 1B) accuratelydetects ^(hm)C of large fragments (5 Kb) at base resolution enablingphasing of DNA modifications and phase DNA modifications together withother genomic features such as SNPs or variants.

FIG. 11A-11B shows a ^(m)C and ^(hm)C profile at single-molecule levelacross the 5.4 kb region generated by PacBio sequencing. Each rowrepresents one DNA molecule. Each CpG site in the 5.4 kb region wasrepresented by a dot. C modification states were denoted by color.

FIG. 11A shows that the present method can be used to phase ^(m)C(red=methylated; blue=unmethylated).

FIG. 11B shows that the present method can be used to phase ^(hm)C(red=hydroxymethylated and blue=unmodified).

FIG. 12A shows an activity comparison of mouse TET2 catalytic domain(TETcd; SEQ ID NO:3) with TETv (SEQ ID NO:1) on sheared 3T3 genomic DNA.

FIG. 12B shows activity of TETv on ss and ds genomic (3T3) DNA issimilar.

FIG. 13 shows that TETv exhibits very low sequence bias and is contextindependent for ^(m)C as demonstrated for 5 cell lines (Arabidopsis,rice, M.Fnu4FI, E14 and Jurkat).

FIG. 14 shows that TETv does not degrade DNA as determined from thepreservation of supercoiled DNA after enzyme treatment. Lane 1 is a sizeladder. Lane 2 is substrate plasmid only, Lane 3 is supercoiledplasmid+323 pmol of TETv; Lane 4 is supercoiled plasmid+162 pmol TETv;Lane 5 is supercoiled plasmid+162 pmol TETv; Lane 6 is Substrateplasmid+323 pmol TETv+BamHI+Mspl; Lane 7 is Substrate plasmid+162 pmolTETv+BamHI+Mspl; and Lane 8 is Substrate plasmid+BamHI+Mspl.

FIG. 15 shows that APOBEC3A can substantially completely deaminate bothC and 5mC.

FIG. 16 shows that low sequence bias of deaminase-Seq includes accuraterepresentation of cytosine in cytosine rich fragments such as CpGislands. Cytosine in CpG islands are substantially depleted usingbisulfite sequencing.

FIG. 17 shows that the lack of fragmentation using Deaminase-Seqcorrelates with a low nucleic acid starting concentration for detectingthe position of modified bases in the nucleic acid. For example, 1 ng ofa genomic DNA library is sufficient for detecting or mapping normal andmodified cytosine.

FIG. 18 shows a second example of methylome phasing (also see FIG. 10and FIG. 11A-11B) using embodiments of the methods described hereinwhere the results of methylome phasing using Deaminase-Seq (SMRT®sequencing, (Pacific Biosciences, Menlo Park, Calif.)) of an imprintedgene. The region of imprinting identified by bisulfite sequencing isrelatively short while a region of greater than twice the length isidentified using Deaminase-Seq. Each red dots on the sequence mapcorrespond to a modified cytosine.

DEFINITIONS

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this invention belongs. Singleton, et al., DICTIONARYOF MICROBIOLOGY AND MOLECULAR BIOLOGY, 2D ED., John Wiley and Sons, NewYork (1994), and Hale & Markham, THE HARPER COLLINS DICTIONARY OFBIOLOGY, Harper Perennial, N.Y. (1991) provide one of skill with thegeneral meaning of many of the terms used herein. Still, certain termsare defined below for the sake of clarity and ease of reference.

As used herein, the term “buffering agent”, refers to an agent thatallows a solution to resist changes in pH when acid or alkali is addedto the solution. Examples of suitable non-naturally occurring bufferingagents that may be used in the compositions, kits, and methods of theinvention include, for example, Tris, HEPES, TAPS, MOPS, tricine, orMES.

The term “non-naturally occurring” refers to a composition that does notexist in nature.

Any protein described herein may be non-naturally occurring, where theterm “non-naturally occurring” refers to a protein that has an aminoacid sequence and/or a post-translational modification pattern that isdifferent to the protein in its natural state. For example, anon-naturally occurring protein may have one or more amino acidsubstitutions, deletions or insertions at the N-terminus, the C-terminusand/or between the N- and C-termini of the protein. A “non-naturallyoccurring” protein may have an amino acid sequence that is different toa naturally occurring amino acid sequence (i.e., having less than 100%sequence identity to the amino acid sequence of a naturally occurringprotein) but that that is at least 80%, at least 85%, at least 90%, atleast 95%, at least 97%, at least 98% or at least 99% identical to thenaturally occurring amino acid sequence. In certain cases, anon-naturally occurring protein may contain an N-terminal methionine ormay lack one or more post-translational modifications (e.g.,glycosylation, phosphorylation, etc.) if it is produced by a different(e.g., bacterial) cell. A “mutant” protein may have one or more aminoacid substitutions relative to a wild-type protein and a “fusion”protein may have one or exogenous domains added to the N-terminus,C-terminus, and or the middle portion of the protein.

In the context of a nucleic acid (NA), the term “non-naturallyoccurring” refers to a NA that contains: a) a sequence of nucleotidesthat is different to a NA in its natural state (i.e. having less than100% sequence identity to a naturally occurring NA sequence), b) one ormore non-naturally occurring nucleotide monomers (which may result in anon-natural backbone or sugar that is not G, A, T or C) and/or C) maycontain one or more other modifications (e.g., an added label or othermoiety) to the 5′-end, the 3′ end, and/or between the 5′- and 3′-ends ofthe NA.

In the context of a composition, the term “non-naturally occurring”refers to: a) a combination of components that are not combined bynature, e.g., because they are at different locations, in differentcells or different cell compartments; b) a combination of componentsthat have relative concentrations that are not found in nature; c) acombination that lacks something that is usually associated with one ofthe components in nature; d) a combination that is in a form that is notfound in nature, e.g., dried, freeze dried, crystalline, aqueous; and/ore) a combination that contains a component that is not found in nature.For example, a preparation may contain a “non-naturally occurring”buffering agent (e.g., Tris, HEPES, TAPS, MOPS, tricine or MES), adetergent, a dye, a reaction enhancer or inhibitor, an oxidizing agent,a reducing agent, a solvent or a preservative that is not found innature.

As used herein, the term “composition” refers to a combination ofreagents that may contain other reagents, e.g., glycerol, salt, dNTPs,etc., in addition to those listed. A composition may be in any form,e.g., aqueous or lyophilized, and may be at any state (e.g., frozen orin liquid form).

As used herein, the term “location” refers to the position of anucleotide in an identified strand in a NA molecule.

As used herein, the term “phasing” refers to a determination of thestatus of two or more nucleotides on a single DNA molecule or within anallele (i.e. whether the nucleotides are modified or not, for example,whether the nucleotides such as C are methylated, hydroxymethylated,formyl modified or carboxylated or unmodified) are on the same moleculeof NA or different homologous chromosomes from a single cell or fromhomologous chromosomes from different cells in a sample noting that indifferent cells or different tissues, homologous chromosomes may have adifferent epigenetic status.

As used herein, the term “nucleic acid” (NA) refers to a DNA, RNA,DNA/RNA chimera or hybrid that may be ss or ds and may be genomic orderived from the genome of a eukaryotic or prokaryotic cell, orsynthetic, cloned, amplified, or reverse transcribed. In certainembodiments of the methods and compositions, NA preferably refers togenomic DNA as the context requires.

As used herein, the term “modified cytosine” refers to methylcytosine(^(m)C), hydroxymethylcytosine (^(hm)C), formyl modified, carboxymodified or modified by any other chemical group that may be foundnaturally associated with C.

As used herein, the term “methylcytosine dioxygenase” refers to anenzyme that converts ^(m)C to ^(hm)C. TET1 (Jin, et al., Nucleic AcidsRes. 2014 42: 6956-71) is an example of a methylcytosine dioxygenase,although many others are known including TET2, TET3 and Naeglaria TET(Pais et al, Proc. Natl. Acad. Sci. 2015 112: 4316-4321). Examples ofmethylcytosine dioxygenases which may be referred to as “oxygenase” areprovided in U.S. Pat. No. 9,121,061. TETv is an example of amethylcytosine dioxygenase that oxidizes at least 90%, 92%, 94%, 96%, or98% of all modified C.

As used herein, the term “cytidine deaminase” refers to an enzyme thatis capable of deaminating C to form a U. Many cytidine deaminases areknown. For example, the APOBEC family of cytidine deaminases isdescribed in U.S. Pat. No. 9,121,061. APOBEC 3A (Stenglein, NatureStructural & Molecular Biology 2010 17: 222-229) is an example of adeaminase. In any embodiment, the deaminase used may have an amino acidsequence that is at least 90% identical to (e.g., at least 95% identicalto) the amino acid sequence of GenBank accession number AKE33285.1,which is the human APOBEC3A. Preferably, the cytidine deaminase convertsunmodified cytosine to uracil with an efficiency of at least 90%, 92%,94%, 96%, 98% preferably at least 96%.

As used herein, the term “DNA glucosyltransferase (GT)” refers to anenzyme that catalyzes the transfer of a β or α-D-glucosyl residueUDP-glucose to ^(hm)C residue in DNA. An example of a GT is T4-βGT. Inone example, the use of GT follows a deoxygenase reaction and ensuresthat deamination of hmC is blocked so that less than 10% or 7% or 5% or3% (preferably less than 3% of hmC) is converted to U by the deaminase.

The term “substantially” refers to greater than 50%, 60%, 70%, 80%, ormore particularly 90% of the whole.

As used herein, the term “comparing” refers to analyzing two or moresequences relative to one another. In some cases, comparing may be doneby aligning or more sequences with one another such that correspondinglypositioned nucleotides are aligned with one another.

As used herein, the term “reference sequence” refers to the sequence ofa fragment that is being analyzed. A reference sequence may be obtainedfrom a public database or it may be separately sequenced as part of anexperiment. In some cases, the reference sequence may be “hypothetical”in the sense that it may be computationally deaminated (i.e., to changeC's into U's or T's etc.) to allow a sequence comparison to be made. Asused herein, the terms “G”, “A”, “T”, “U”, “C”, “^(m)C”, “^(ca)C”,“^(hm)C” and “^(ghm)C” refer to nucleotides that contain guanidine (G),adenine (A), thymine (T), uracil (U), cytosine (C), ^(m)C, ^(ca)C,^(hm)C and ^(ghm)C, respectively. For clarity, C, ^(ca)C, ^(m)C and^(ghm)C are different moieties.

As used herein, the term “no C”, in the context of a NA fragment thatcontains no C, refers to a NA fragment that contains no C. Such a NA maycontain ^(ca)C, ^(m)C and/or ^(ghm)C and other nucleotides other than C.

The term “internal” refers to a location within the polypeptide that iswithin a region that extends up to amino acids from either end of thepolypeptide.

The term “repeat” refers to a plurality of amino acids that are repeatedwithin the polypeptide.

The term “fusion” refers to a protein having one or exogenous bindingdomains added to the N-terminus, C-terminus, and or the middle portionof the protein. The binding domain is capable of recognizing and bindingto another molecule. Thus, in some embodiments the binding domain is ahistidine tag (“His-tag”), a maltose-binding protein, a chitin-bindingdomain, a SNAP-Tag® (New England Biolabs, Ipswich, Mass.) or aDNA-binding domain, which may include a zinc finger and/or atranscription activator-like (TAL) effector domain.

As used herein “N-terminal portion of the protein” refers to amino acidswithin the first 50% of the protein. As used herein “C-terminal portionof the protein refers to the terminal 50% of the protein.

The term “Next Generation Sequencing (NGS)” generally applies tosequencing libraries of genomic fragments of a size of less than 1 kbpreferably using an Illumina sequencing platform. In contrast, singlemolecule sequencing is performed using a platform from PacificBiosystems, Oxford Nanopore, or 10× Genomics or any other platform knownin the art that is capable of sequencing molecules of length greaterthan 1 kb or 2 kb.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Before the various embodiments are described, it is to be understoodthat the teachings of this disclosure are not limited to the particularembodiments described, and as such can, of course, vary. It is also tobe understood that the terminology used herein is for the purpose ofdescribing particular embodiments only, and is not intended to belimiting, since the scope of the present teachings will be limited onlyby the appended claims.

The section headings used herein are for organizational purposes onlyand are not to be construed as limiting the subject matter described inany way. While the present teachings are described in conjunction withvarious embodiments, it is not intended that the present teachings belimited to such embodiments. On the contrary, the present teachingsencompass various alternatives, modifications, and equivalents, as willbe appreciated by those of skill in the art.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as commonly understood by one of ordinary skill inthe art to which this disclosure belongs. Although any methods andmaterials similar or equivalent to those described herein can also beused in the practice or testing of the present teachings, the someexemplary methods and materials are now described.

The citation of any publication is for its disclosure prior to thefiling date and should not be construed as an admission that the presentclaims are not entitled to antedate such publication by virtue of priorinvention. Further, the dates of publication provided can be differentfrom the actual publication dates which can need to be independentlyconfirmed.

As will be apparent to those of skill in the art upon reading thisdisclosure, each of the individual embodiments described and illustratedherein has discrete components and features which can be readilyseparated from or combined with the features of any of the other severalembodiments without departing from the scope or spirit of the presentteachings. Any recited method can be carried out in the order of eventsrecited or in any other order which is logically possible.

All patents and publications, including all sequences disclosed withinsuch patents and publications, referred to herein are expresslyincorporated by reference.

Almost all studies on C modification in eukaryotic genomes have ignoredthe fact that eukaryotic genomes carry two or more copies of eachchromosome. Thus, most traditional studies on C modification do notprovide any information about linkage between modified C. For example,methylation studies have traditionally been done using sodium bisulfite,which converts C into U. However, as shown below, sodium bisulfite alsofragments DNA, thereby making it difficult, if not impossible, todetermine whether two nearby modified C, are linked on the same DNAmolecules or unlinked on different molecules. The method describedherein provides a solution to this problem.

In some embodiments, the sequencing may be done in a way that allows oneto determine the identity and location of unmodified or modified C, aswell as whether those unmodified or modified C that are linked on thesame molecule (i.e., “phased”). For example, in some embodiments, themethod may comprise reacting a first portion of a sample that containsrelatively long, intact NA fragments (e.g., at least 1 kb, at least 5kb, at least 10 kb, at least 50 kb, up to 100 kb or 200 kb or more inlength) with a GT and a cytidine deaminase to produce a first product.This product differentiates C and ^(m)C from ^(hm)C as shown in FIG. 1B.A second portion of the sample may be reacted with a methylcytosinedioxygenase (and optionally a GT) as shown in FIG. 1A. Themethylcytosine dioxygenase and the GT may be combined in the samereaction mix or used sequentially in the same or different buffers. Thisreaction is followed by a cytidine deaminase reaction to distinguishbetween unmodified C and modified C. Depending on the sequence of theinitial fragment (e.g., whether the initial fragment in FIG. 1B containsG, A, T, C, ^(m)C and, in some cases, ^(hm)C), the first product maycontain G, A, T, U, no C and ^(ghm)C (if the initial fragment contained^(hm)C). In FIG. 1A, the second product alone may contain G, A, T, U,^(ca)C and no C. These enzyme and methods avoid degradation of the NAsubstrate and provide improved phasing of modified nucleotide over longpieces of the genome that are not degraded by the enzymes. These enzymeand methods achieve sequencing and mapping of modified nucleotides withminimal bias and improved efficiency.

After the first and optionally second products are produced, they may beamplified and/or cloned, and then sequenced using suitable sequencingmethod. This may include single molecule sequencing for phasedsequencing, Phased sequencing may be done in a variety of differentways. In some embodiments, the products may be sequenced using a longread single-molecule sequencing approach such as Nanopore sequencing(e.g. as described in Soni, et al Clin Chem 53:1996-2001 2007, anddeveloped by Oxford Nanopore Technologies) or Pacific Biosciences'fluorescent base-cleavage method (which currently have an average readlength of over 10 kb, with some reads over 60 kb). Alternatively, theproducts may be sequenced using, the methods of Moleculo (Illumina, SanDiego, Calif.), 10× Genomics (Pleasanton, Calif.), or NanoStringTechnologies (Seattle, Wash.). In these methods, the sample isoptionally diluted and then partitioned into a number of partitions(wells of a microtitre plate or droplets in an emulsion, etc.) in anamount that limits the probability that each partition does not containtwo molecules of the same locus (e.g., two molecules containing the samegene). Next, these methods involve producing indexed amplicons of a sizethat is compatible with the sequencing platform being used (e.g.,amplicons in the range of 200 bp to 1 kb in length) where ampliconsderived from the same partitions are barcoded with the same index uniqueto the partition. Finally, the indexed amplicons are sequenced, and thesequence of the original, long, molecules can be reconstituted using theindex sequences. Phased sequencing may also be done using barcodedtransposons (see, e.g., Adey Genome Res. 2014 24: 2041-9 and Amini NatGenet. 2014 46: 1343-9), and by using the “reflex” system of PopulationGenetics Technologies (Casbon, Nucleic Acids Res. 2013 41:e112).

Alternatively, the genome may be fragmented into fragments of less than1 kb in size to form a library for Next gen sequencing. Pyrrolo-dCmodified adaptors may be added to the fragments in the library prior toenzyme treatment according to FIG. 1A-1B and Example 1. After the enzymereaction, the adaptor ligated libraries may be sequenced using anIllumina sequencer. After the sequences of the first and optionally thesecond product are obtained, the sequences are compared with a referencesequence to determine which C's in the initial NA fragment are modified.A matrix illustrating an embodiment of this part of the method isillustrated in FIG. 1C. In some embodiments, this comparing may be doneby comparing the sequences obtained from the first product of the sample(i.e., the methylcytosine dioxygenase (and optionally GT) and cytidinedeaminase treated portion of the sample) and the untreated sample and/orsecond product of the sample (i.e., the GT and cytidine deaminasetreated portion of the sample) with a corresponding reference sequence(untreated and/or the first product). Possible outcomes include:

-   -   i. The position of a C in the initial NA fragment is identified        by a U in both the first and second products;    -   ii. The position of a ^(m)C in the initial NA fragment is        determined by the presence of a C in the first product or a T in        the second product    -   iii. The position of a ^(hm)C in the initial NA fragment is        determined by the presence of a C in the second product only.

It should be noted that should there be no need to differentiate the^(m)C from the rarer ^(hm)C, then this information can be obtained fromthe second product only (FIG. 1A).

As would be understood, if the product is cloned, amplified or sequencedby a polymerase, a “U” will be read as “T”. In these embodiments,nucleotides read as a T in both the first and second products stillindicate Cs that have been changed to Us in the initial deaminationreaction.

As would be recognized, some of the analysis steps of the method, e.g.,the comparing step, can be implemented on a computer. In certainembodiments, a general-purpose computer can be configured to afunctional arrangement for the methods and programs disclosed herein.The hardware architecture of such a computer is well known by a personskilled in the art, and can comprise hardware components including oneor more processors (CPU), a random-access memory (RAM), a read-onlymemory (ROM), an internal or external data storage medium (e.g., harddisk drive). A computer system can also comprise one or more graphicboards for processing and outputting graphical information to displaymeans. The above components can be suitably interconnected via a businside the computer. The computer can further comprise suitableinterfaces for communicating with general-purpose external componentssuch as a monitor, keyboard, mouse, network, etc. In some embodiments,the computer can be capable of parallel processing or can be part of anetwork configured for parallel or distributive computing to increasethe processing power for the present methods and programs. In someembodiments, the program code read out from the storage medium can bewritten into memory provided in an expanded board inserted in thecomputer, or an expanded unit connected to the computer, and a CPU orthe like provided in the expanded board or expanded unit can actuallyperform a part or all of the operations according to the instructions ofthe program code, so as to accomplish the functions described below. Inother embodiments, the method can be performed using a cloud computingsystem. In these embodiments, the data files and the programming can beexported to a cloud computer that runs the program and returns an outputto the user.

A system can, in certain embodiments, comprise a computer that includes:a) a central processing unit; b) a main non-volatile storage drive,which can include one or more hard drives, for storing software anddata, where the storage drive is controlled by disk controller; c) asystem memory, e.g., high speed random-access memory (RAM), for storingsystem control programs, data, and application programs, includingprograms and data loaded from non-volatile storage drive; system memorycan also include read-only memory (ROM); d) a user interface, includingone or more input or output devices, such as a mouse, a keypad, and adisplay; e) an optional network interface card for connecting to anywired or wireless communication network, e.g., a printer; and f) aninternal bus for interconnecting the aforementioned elements of thesystem.

The method described above can be employed to analyze genomic DNA fromvirtually any organism, including, but not limited to, plants, animals(e.g., reptiles, mammals, insects, worms, fish, etc.), tissue samples,bacteria, fungi (e.g., yeast), phage, viruses, cadaveric tissue,archaeological/ancient samples, etc. In certain embodiments, the genomicDNA used in the method may be derived from a mammal, where in certainembodiments the mammal is a human. In exemplary embodiments, the genomicsample may contain genomic DNA from a mammalian cell, such as, a human,mouse, rat, or monkey cell. The sample may be made from cultured cells,formalin fixed samples or cells of a clinical sample, e.g., a tissuebiopsy (for example from a cancer), scrape or lavage or cells of aforensic sample (i.e., cells of a sample collected at a crime scene). Inparticular embodiments, the NA sample may be obtained from a biologicalsample such as cells, tissues, bodily fluids, and stool. Bodily fluidsof interest include but are not limited to, blood, serum, plasma,saliva, mucous, phlegm, cerebral spinal fluid, pleural fluid, tears,lactal duct fluid, lymph, sputum, cerebrospinal fluid, synovial fluid,urine, amniotic fluid, and semen. In particular embodiments, a samplemay be obtained from a subject, e.g., a human. In some embodiments, thesample analyzed may be a sample of cell-free DNA obtained from blood,e.g., from the blood of a pregnant female.

In some embodiments of the invention, an enzymatic method has beenprovided which permits the sequencing of short and long NA (for example,ss DNA and ds DNA) to discover modified bases and to determine thephasing of such bases in the genome. Embodiments of the method mayinclude a composition comprising a mixture of one or two enzymes wherethe one, two enzymes are selected from a methylcytosine dioxygenase anda GT where the cytidine deaminase is added in a subsequent reaction. Thedioxygenase and GT may be stored in the same or different buffers andcombined as desired in a storage buffer or in a reaction mixture. Whenadded separately to a reaction mixture, the addition may be sequentialor the enzymes may be added together at the start of the reaction.Embodiments of the method may utilize two or more enzymes selected froma cytidine deaminase, a methylcytosine dioxygenase and a GT. Embodimentsof the method may include a methylcytosine dioxygenase and a cytidinedeaminase used sequentially in a reaction mixture; a methylcytosinedioxygenase and a GT used sequentially or together preferably followedby a deaminase reaction; or a methylcytosine dioxygenase, GT andcytidine deaminase used sequentially or together.

In some embodiments, that utilize a GT, a UDP may be added to thereaction mixture.

In one embodiment, the methylcytosine dioxygenase and optionally the GTmay be added to ds DNA in an initial step and then removed by aproteinase treatment, heat treatment and/or separation treatment. Thismay be followed by a cytidine deaminase reaction with separation andisolation of the deaminated DNA. In some embodiments, the pH of thecytidine deaminase reaction mixture is in the range of pH 5.5-8.5, forexample pH 6.0-8.0 for example, pH 6.0, pH 6.3, pH 6.5, pH 6.8, pH 7.0,pH 7.5, or pH 8.0 wherein the specific activity of the cytidinedeaminase is increased at the lower end of the pH range such as at pH6.0.

In one embodiment, concentration ranges of enzymes utilized in thereaction described for 1 μg DNA include: 0.001-100 micrograms of amethylcytosine dioxygenase such as the Ngo TET (Pais, supra), TET1, TET2or TET3 or mutants thereof; 0.001-100 micrograms cytidine deaminase suchas APOBEC or Deaminase; 0.001-100 units GT such as T4-βGT or T4-αGT.When Pyrollo-dC used in adaptor synthesis, a standard proceduredescribed in Example 4 is followed. The amount of UDP used follows therecommendation of the manufacturer.

The ss DNA product of enzyme reaction or reactions can be amplified byPCR or isothermal method such ligase mediated amplification (LMA),helicase dependent amplification (HDA), rolling circle amplification(RCA), loop mediated amplification (LAMP), multiple displacementamplification, (MDA); transcription mediated amplification (TMA), stranddisplacement amplification (SDA), nicking enzyme amplification reaction(NEAR).

The amplified or indeed non-amplified DNA may be sequenced using any ofthe sequencing platforms in development or commercially available suchas provided by Illumina, Oxford Nanopore, or Pacific Biosystems, ormethods in development or commercially available such as Sangersequencing or any WGS (whole genome sequencing) method. Long reads aremapped to the genome using the appropriate algorithm, for example,Bismark (see for example, Krueger et al. Bioinformatics 27, no. 11(2011): 1571-1572). The methylation status is called when each reads ismapped to the targeted region (for example, enhancer and promoterregion).

Present embodiments provide many advantages over existing systems thatresult from factors that include: a lower error rate in identifying^(m)C regardless of adjacent nucleotides, and a lower error rate indetecting low level methylations; no systematic sequence preference;more consistent genome wide sequencing coverage; higher coverage in Crich regions and CpG islands; covering more CpG sites where these may bedistributed widely in the genome portion being analyzed; and accuratedetection of ^(hm)C of large fragments (5 kb) at a base resolutionenabling phasing of DNA modifications and phasing DNA modificationstogether with other genomic features such as SNPs or variants.

In some embodiments, the composition may comprise a NA that is made upof nucleotides G, A, T, U, ^(ca)C, wherein the NA contains substantiallyno C. In some embodiments, the composition may comprise a NA that ismade up of nucleotides G, A, T, U and ^(ghm)C, wherein the NA containssubstantially no C. In either embodiment, the composition may alsocontain a cytidine deaminase (e.g., a cytidine deaminase that is atleast 90% identical to an APOBEC cytidine deaminase) and, in certainembodiments, may also contain a buffering agent and other components(e.g., NaCl) in amounts that are compatible with cytidine deaminaseactivity. The composition may be an aqueous composition.

Variant. ^(m)C Dioxygenases and Methods for Using the Same

A variant methylcytosine dioxygenase is also provided. In someembodiments, the methylcytosine dioxygenase comprises an amino acidsequence that is at least 90% identical to (e.g., at least 92%, at least94%, at least 96%, at least 97%, at least 98%, or at least 99% identicalto) the amino acid sequence of TETv (SEQ ID NO:1); and contain the aminoacid sequence of SEQ ID NO:2. As would be apparent, this polypeptide has^(m)C dioxygenase activity. The TETv sequence is shown below:

TETv (SEQ ID NO: 1) GGSQSQNGKCEGCNPDKDEAPYYTHLGAGPDVAAIRTLMEERYGEKGKAIRIEKVIYTGKEGKSSQGCPIAKWVYRRSSEEEKLLCLVRVRPNHTCETAVMVIAIMLWDGIPKLLASELYSELTDILGKCGICTNRRCSQNETRNCCCQGENPETCGASFSFGCSWSMYYNGCKFARSKKPRKFRLHGAEPKEEERLGSHLQNLATVIAPIYKKLAPDAYNNQVEFEHQAPDCCLGLKEGRPFSGVTACLDFSAHSHRDQQNMPNGSTVVVTLNREDNREVGAKPEDEQFHVLPMYIIAPEDEFGSTEGQEKKIRMGSIEVLQSFRRRRVIRIG

DAA AVQEIEYWSDSEHNFQDPCIGGVAIAPTHGSILIECAKCEVHATTKVNDPDRNHPTRISLVLYRHKNLFLPKHCLALWEAKMAEKARKEEECGKNGSDHVSQKNHGKQEKREPTGPQEPSYLRFIQSLAENTGSVTTDSTVTTSPYAFTQ VTGPYNTFV

TETv is derived from mouse Tet2 catalytic domain and contains adeletion. The amino acid sequence ELPKSCEVSGQ (SEQ ID NO:2) isitalicized within the sequence of TETv and TETcd sequences shown aboveand below.

TETcd (TET-2 catalytic domain) (SEQ ID. NO. 3)QSQNGKCEGCNPDKDEAPYYTHLGAGPDVAAIRTLMEERYGEKGKAIRIEKVIYTGKEGKSSQGCPIAKWVYRRSSEEEKLLCLVRVRPNHTCETAVMVIAIMLWDGIPKLLASELYSELTDILGKCGICTNRRCSQNETRNCCCQGENPETCGASFSFGCSWSMYYNGCKFARSKKPRKFRLHGAEPKEEERLGSHLQNLATVIAPIYKKLAPDAYNNQVEFEHQAPDCCLGLKEGRPFSGVTACLDFSAHSHRDQQNMPNGSTVVVTLNREDNREVGAKPEDEQFHVLPMYIIAPEDEFGSTEGQEKKIRMGSIEVLQSFRRRRVIRIGELPKSC KKKAEPKKAKTKKAARKRSSLENCSSRTEKGKSSSHTKLMENASHMKQMTAQPQLSGPVIRQPPTLQRHLQQGORPQQPQPPQPQPQTTPQPQPQPQHIMPGNSQSVGSHCSGSTSVYTRQPTPHSPYPSSAHTSDIYGDTNHVNFYPTSSHASGSYLNPSNYMNPYLGLLNQNNQYAPFPYNGSVPVDNGSPFLGSYSPQAQSRDLHRYPNQDHLTNQNLPPIHTLHQQTFGDSPSKYLSYGNQNMQRDAFTTNSTLKPNVHHLATFSPYPTPKMDSHFMGAASRSPYSHPHTDYKTSEHHLPSHTIYSYTAAASGSSSSHAFHNKENDNIANGLSRVLPGFNHDRTASAQELLYSLTGSSQ EKQPEVSGQDAAAVQEIEYWSDSEHNFQDPCIGGVAIAPTHGSILIECAKCEVHATTKVNDPDRNHPTRISLVLYRHKNLFLPKHCLALWEAKMAEKARKEEECGKNGSDHVSQKNHGKQEKREPTGPQEPSYLRFIQSLAENTGSVTTD STVTTSPYAFTQVTGPYNTFV

The deleted amino acids correspond to residues 338 to 704 TETcd (shownin italics above). The amino acid sequence ELPKSCEVSGQ (SEQ ID NO:2)contains 5 amino acids from one side of the junction and 5 amino acidsfrom the other side of the junction, as shown above.

In some embodiments, the variant methylcytosine dioxygenase may be afusion protein. In these embodiments, the variant may have a bindingdomain that is capable of recognizing and binding to another molecule.Thus, in some embodiments the binding domain is a histidine tag(“His-tag”) although a maltose-binding protein, a chitin-binding domain,a SNAP-Tag® or a DNA-binding domain, which may include a zinc fingerand/or a transcription activator-like (TAL) effector domain are alsoexamples of binding moieties.

Embodiments include a buffered composition containing a purified TETv.For example, the pH of the buffer in the composition is pH 5.5-8.5, forexample pH 5.5-7.5, pH 7.5-8.0 or pH 8.0. In various embodiments, thebuffered composition may contain glycerol; and/or contains Fe(II), ascofactor, and α-ketoglutarate, as co-substrate, for the enzyme. In someof these embodiments, the composition contains ATP to allow furtheroxidation of ^(hmC) to ^(fC) and ^(ca)C; in other embodiments, thecomposition does not contain dATP that limits the distribution of theoxidized forms of ^(mC).

Embodiments include an in vitro mixture that includes a TETv, a βGT, acytidine deaminase, and/or an endonuclease. The in vitro mixture mayfurther include a polynucleotide substrate and at least dATP. Thepolynucleotide could be ss or ds, a DNA or RNA, a synthesizedoligonucleotide (oligo), chromosomal DNA, or an RNA transcript. Thepolynucleotide used could be labeled at one or both ends. Thepolynucleotide may harbor a C, ^(m)C, ^(hmC), ^(fC), ^(caC) or ^(ghm)C.In other embodiments, the polynucleotide may harbor a T, U,hydroxymethyluracil (^(hm)U), formyluracil (^(f)U), or carboxyuracil(^(ca)U).

Embodiments provide a TETv, which oxidizes ^(m)C to ^(hm)C, ^(f)C,and/or ^(ca)C preferably in any sequence context with minimal sequencebias and minimal damage to the DNA substrate compared to BS-seq. TETvmay additionally or alternatively oxidize T to ^(hm)U or ^(f)U withimproved efficiency and reduced bias compared with naturally occurringmouse TET-2 enzyme, or its catalytic domain (TETcd).

In an embodiment of the method, C could be distinguished from ^(m)C byreacting the polynucleotide of interest with a TETv and a cytidinedeaminase wherein only C is converted to U. A further embodimentincludes sequencing the polynucleotide treated with the βGT and thecytidine deaminase in which C is converted to U and ^(m)C is convertedto a T and comparing the sequencing results to that of sequencing theuntreated polynucleotide to map ^(m)C and ^(hm)C location in thepolynucleotide.

In another embodiment of the method, both ^(m)C and ^(hm)C locations ina polynucleotide are mapped. In this method: (a) the polynucleotide isuntreated; (b) reacted with bisulfite reagent; or (c) reacted with GTprior to adding a methylcytosine dioxygenase then treating withbisulfite reagent, (a) through (c) are sequenced and comparison of thesequencing results enables the mapping of ^(m)C and ^(hm)C and theirdifferentiation from C: (a) C, ^(m)C, and ^(hm)C are all sequenced as C;(b) C is sequenced as C while ^(m)C and ^(hm)C as T; and (c)^(hm)C isconverted to ^(ghm)C and sequenced as C, C is sequenced as C, and ^(m)Cas T.

In some embodiments, ^(m)C locations in a polynucleotide are mapped bycoupling the oxidation activity of TETv to the activity of a restrictionendonuclease or an AP endonuclease specific to ^(hm)C or ^(f)C/^(ca)C,respectively.

In some aspects, ^(m)C, ^(hm)C, or ^(f)C may be mapped to sites in apolynucleotide using single-molecule sequencing technologies such asSingle Molecule Real-Time (SMRT) Sequencing, Oxford Nanopore SingleMolecule Sequencing (Oxford, UK) or 10× Genomics (Pleasanton, Calif.).In some embodiments, the method may employ TETv, a cytidine deaminase,and/or GT.

The above-described TETv enzyme can be used as a methylcytosinedioxygenase in any of the methods, compositions or kits summarized aboveand described in greater detail below.

Kits

Also provided by the present disclosure are kits for practicing thesubject method as described above. In certain embodiments, a subject kitmay contain: a GT, a methylcytosine dioxygenase and a cytidinedeaminase. The components of the kit may be combined in one container,or each component may be in its own container. For example, thecomponents of the kit may be combined in a single reaction tube or inone or more different reaction tubes. Further details of the componentsof this kit are described above. The kit may also contain other reagentsdescribed above and below that may be employed in the method, e.g., abuffer, ADP-glucose, plasmids into which NAs can be cloned, controls,amplification primers, etc., depending on how the method is going to beimplemented.

In addition to above-mentioned components, the subject kit may furtherinclude instructions for using the components of the kit to practice thesubject method. The instructions for practicing the subject method aregenerally recorded on a suitable recording medium. For example, theinstructions may be printed on a substrate, such as paper or plastic,etc. As such, the instructions may be present in the kits as a packageinsert, in the labeling of the container of the kit or componentsthereof (i.e., associated with the packaging or subpackaging) etc. Inother embodiments, the instructions are present as an electronic storagedata file present on a suitable computer readable storage medium, e.g.CD-ROM, diskette, etc. In yet other embodiments, the actual instructionsare not present in the kit, but means for obtaining the instructionsfrom a remote source, e.g. via the internet, are provided. An example ofthis embodiment is a kit that includes a web address where theinstructions can be viewed and/or from which the instructions can bedownloaded. As with the instructions, this means for obtaining theinstructions is recorded on a suitable substrate.

Utility

In some embodiments, the method can be used to compare two samples. Inthese embodiments, the method may be used to identify a difference inthe pattern of C modification in a test NA fragment relative to thepattern of cytosine modification in a corresponding reference NA. Thismethod may comprise (a) determining the location of all modified C in atest NA fragment using the above-described method to obtain a firstpattern of C modification; (b) determining the location of all modifiedC in a reference NA fragment using the above-described method to obtaina first pattern of C modification; (c) comparing the test and referencepatterns of C modification; and (d) identifying a difference in thepattern of cytosine modification, e.g., a change in the amount of ^(m)Cor ^(hm)C, in the test NA fragment relative to the reference NAfragment.

In some embodiments, the test NA and the reference NA are collected fromthe same individual at different times. In other embodiments, the testNA and the reference NA collected from tissues or different individuals.

Exemplary NAs that can be used in the method include, for example, NAisolated from cells isolated from a tissue biopsy (e.g., from a tissuehaving a disease such as colon, breast, prostate, lung, skin cancer, orinfected with a pathogen etc.) and NA isolated from normal cells fromthe same tissue, e.g., from the same patient; NA isolated from cellsgrown in tissue culture that are immortal (e.g., cells with aproliferative mutation or an immortalizing transgene), infected with apathogen, or treated (e.g., with environmental or chemical agents suchas peptides, hormones, altered temperature, growth condition, physicalstress, cellular transformation, etc.), and NA isolated from normalcells (e.g., cells that are otherwise identical to the experimentalcells except that they are not immortalized, infected, or treated,etc.); NA isolated from cells isolated from a mammal with a cancer, adisease, a geriatric mammal, or a mammal exposed to a condition, and NAisolated from cells from a mammal of the same species, e.g., from thesame family, that is healthy or young; and NA isolated fromdifferentiated cells and NA isolated from non-differentiated cells fromthe same mammal (e.g., one cell being the progenitor of the other in amammal, for example). In one embodiment, NA isolated from cells ofdifferent types, e.g., neuronal and non-neuronal cells, or cells ofdifferent status (e.g., before and after a stimulus on the cells) may becompared. In another embodiment, the experimental material is NAisolated from cells susceptible to infection by a pathogen such as avirus, e.g., human immunodeficiency virus (HIV), etc., and the referencematerial is NA isolated from cells resistant to infection by thepathogen. In another embodiment of the invention, the sample pair isrepresented by NA isolated from undifferentiated cells, e.g., stemcells, and NA isolated from differentiated cells.

In some exemplary embodiments, the method may be used to identify theeffect of a test agent, e.g., a drug, or to determine if there aredifferences in the effect of two or more different test agents. In theseembodiments, NA from two or more identical populations of cells may beprepared and, depending on how the experiment is to be performed, one ormore of the populations of cells may be incubated with the test agentfor a defined period of time. After incubation with the test agent, thegenomic DNA from one both of the populations of cells can be analyzedusing the methods set forth above, and the results can be compared. In aparticular embodiment, the cells may be blood cells, and the cells canbe incubated with the test agent ex vivo. These methods can be used todetermine the mode of action of a test agent, to identify changes inchromatin structure or transcription factor occupancy in response to thedrug, for example.

The method described above may also be used as a diagnostic (which termis intended to include methods that provide a diagnosis as well asmethods that provide a prognosis). These methods may comprise, e.g.,analyzing C modification from a patient using the method described aboveto produce a map; and providing a diagnosis or prognosis based on themap.

The method set forth herein may also be used to provide a reliablediagnostic for any condition associated with altered cytosinemodification. The method can be applied to the characterization,classification, differentiation, grading, staging, diagnosis, orprognosis of a condition characterized by an epigenetic pattern. Forexample, the method can be used to determine whether the C modificationsin a fragment from an individual suspected of being affected by adisease or condition is the same or different compared to a sample thatis considered “normal” with respect to the disease or condition. Inparticular embodiments, the method can be directed to diagnosing anindividual with a condition that is characterized by an epigeneticpattern at a particular locus in a test sample, where the pattern iscorrelated with the condition. The methods can also be used forpredicting the susceptibility of an individual to a condition.

In some embodiments, the method can provide a prognosis, e.g., todetermine if a patient is at risk for recurrence. Cancer recurrence is aconcern relating to a variety of types of cancer. The prognostic methodcan be used to identify surgically treated patients likely to experiencecancer recurrence so that they can be offered additional therapeuticoptions, including preoperative or postoperative adjuncts such aschemotherapy, radiation, biological modifiers and other suitabletherapies. The methods are especially effective for determining the riskof metastasis in patients who demonstrate no measurable metastasis atthe time of examination or surgery.

The method can also be used to determining a proper course of treatmentfor a patient having a disease or condition, e.g., a patient that hascancer. A course of treatment refers to the therapeutic measures takenfor a patient after diagnosis or after treatment. For example, adetermination of the likelihood for recurrence, spread, or patientsurvival, can assist in determining whether a more conservative or moreradical approach to therapy should be taken, or whether treatmentmodalities should be combined. For example, when cancer recurrence islikely, it can be advantageous to precede or follow surgical treatmentwith chemotherapy, radiation, immunotherapy, biological modifiertherapy, gene therapy, vaccines, and the like, or adjust the span oftime during which the patient is treated.

In a particular embodiment, a lab will receive a sample (e.g., blood)from a remote location (e.g., a physician's office or hospital), the labwill analyze a NA isolated from the sample as described above to producedata, and the data may be forwarded to the remote location for analysis.

Epigenetic regulation of gene expression may involve cis or trans-actingfactors including nucleotide methylation. While cis-acting methylatednucleotides are remotely positioned in a DNA sequence corresponding toan enhancer, these sites may become adjacent to a promoter in athree-dimensional structure for activating or deactivating expression ofa gene. Enhancers can be megabases away from the corresponding promoterand thus understanding the relationship between a methylation site in anenhancer and its impact on a corresponding promoter (phasing) over longdistances is desirable. Phasing the methylation of a distantly locatedenhancer to a promoter on which it acts can provide important insightsinto gene regulation and mis-regulation that occurs in diseases such ascancer.

In order to further illustrate the present invention, the followingspecific examples are given with the understanding that they are beingoffered to illustrate the present invention and should not be construedin any way as limiting its scope.

All references cited herein are incorporated by reference.

EXAMPLES Example 1. Enzyme Based Method for Mapping Methylcytosine andHydroxymethylcytosine

Embodiments of methods described herein provide an unbiased efficientmeans of mapping ^(m)C and ^(hm)C along long stretches of genomic DNA.Such methods describe how to protect biologically relevant DNAmodification, such as ^(m)C and ^(hm)C in DNA deamination reaction inorder to detect and read these modifications. The methods avoid unwantedfragmentation that arises using chemical methods (such as the bisulfitemethod). The enzymatic methods use one or more of the following enzymes:a cytidine deaminase, a methylcytosine dioxygenase and a GT.

Examples are provided that utilize a cytidine deaminase described inU.S. Pat. No. 9,121,061 (specifically APOBEC3A in this example) althoughother cytidine deaminases may be used (as discussed above). The Examplesprovided herein utilize Deaminase-seq. Deaminase-seq refers to thepathway that depends on a deaminase reaction leading to sequencing todetect modified cytosine. The pathway shown in FIG. 1A may furtherinclude a GT such as β) which may be combined with the methylcytosinedioxygenase in one reaction mix or added sequentially in one reactionvessel. A novel methylcytosine dioxygenase is described herein thatprovides more efficient and unbiased conversion of ^(m)C and ^(hm)C to^(Ca)C then does wild type human or mouse TET proteins. Typically,Deaminase-Seq includes the following steps: treating genomic DNA or DNAlibrary preparations (such as Ultra II Library prep with protectedadaptors (NEB)), the use of one or more of TET2 deoxygenase and GTenzymes for example, TET2 deoxygenase followed by GT (BGT) or inparallel with GT, removal of enzyme activity by for example heatdenaturation followed by deamination using for example Apobec A3A,amplification and then sequencing in an Illumina sequencer, PacBiosequencer or other commercially available sequencing device. Furtherexperimental details for embodiments are provided below.

A. Discrimination of Methylcytosine from Unmodified Cytosine in GenomicDNA Using an Engineered Methylcytosine Dioxygenase (TETv) and a CytidineDeaminase (APOBEC)

(i) Mouse NIH/3T3 DNA (250 ng) was reacted with TETv (8 μM) in 50 ulTris buffer at 37° C. for 1 hour and the oxidized DNA was columnpurified (Zymo Research, Irvine, Calif.).

(ii) The DNA was then heated to 70° C. in presence of 66% of formamidein a thermocycler and then placed on ice. RNase A (0.2 mg/ml), BSA (10mg/ml) and cytidine deaminase (0.3 mg/ml) were added (see alsoBransteitter et al. PNAS (2003) vol 100, 4102-4107) and incubated for 3hours at 37° C. DNA was column purified (Zymo Research, Irvine, Calif.).Following PCR with U-bypass DNA polymerase (New England Biolabs,Ipswich, Mass.) using Primer 1 AATGAAGGAAATGAATTTGGTAGAG (SEQ ID NO:6)and Primer 2 T CCC AA AT AC AT A AATCC AC ACTT A (SEQ ID NO:7), theproducts were cloned using the NEB PCR Cloning Kit (New England Biolabs,Ipswich, Mass.) and the clones were subjected to Sanger sequencing.Sequencing results are summarized in FIG. 2A. Empty dots representunmodified CpG sites in the PCR fragment, black dots represent ^(m)CpGsites in the PCR fragment.

B. Discrimination of Hydroxymethylcytosine from Unmodified Cytosine andMethylcytosine Using T4-8GT (New England Biolabs, Ipswich, Mass.) andCytidine Deaminase (i) DNA was reacted with T4-βGT (20 Units) in thepresence of UDP (1 μl) in a volume of 50 μl at 37° C. for 1 hour andthen column purified DNA. The method followed the steps in (ii) above.Sequencing results are summarized in FIG. 2B. Empty dots representunmodified CpG sites in the PCR fragment, black dots represent ^(hm)CpGsites in the PCR fragment.

Example 2. ss DNA is not Damaged During Methylcytosine Deoxygenase, DNAGlucosyltransferase or Cytidine Deaminase Treatment

The demonstration that DNA damage does not occur during the analysis ofmodified bases in ss DNA is a significant advantage of the currentbisulfite method commonly used for methylome analysis (see FIG. 3A-3E).It is the lack of damage as shown in FIG. 3A-3B, 3D-3E that makes itpossible to obtain phase data.

Mouse E14 genomic DNA was sheared to fragments (Covaris, Woburn, Mass.)of a size of approximately 15 kb and selected and purified using AMPure®XP beads (Beckman Coulter, Brea, Calif.). The DNA were then treated asfollows:

(a) Control DNA. The 15 kb fragments of DNA was denaturated to ssDNA at70° C. in presence of 66% of formamide for 10 minutes.

(b) Bisulfite converted DNA. The 15 kb fragments of DNA were treatedwith sodium bisulfite using EZ DNA Methylation-Gold™ Kit (Zymo Research,Irvine, Calif.), according to the instruction manual.

(c) T4-βGT and cytidine deaminase (APOBEC3A) treated DNA. 15 kb DNAfragments were glucosylated and then deaminated as described in Example1.

(d) TETv and cytidine deaminase (APOBEC3A) treated DNA. 15 kb DNAfragments were treated with TETv, and then deaminated as describedabove.

Initially the DNA from samples (a)-(d) were examined on an Agilent RNA6000 pico chip (Agilent, Santa Clara, Calif.). The data is given in FIG.3E (y-axis is the fluorescent units while the X-axis is size (daltons).The light blue line represents the denatured ss DNA of the 15 kb AMPuresize selected fragments, which is also the control. The red line isAPOBEC deamination on glucosylated DNA. The dark blue is DNA deaminationon TETv oxidized DNA. And the green is bisulfite treated DNA. Whencomparing to the control, both cytidine deaminase treated substratesshow no significant difference in size distribution whereas thebisulfite treated DNA reduced in size greatly, showing significant DNAdegradation.

The 15 Kb treated DNA from samples (a)-(d) was also PCR amplified toproduce amplicons of 4229 bp, 3325 bp, 2018 bp, 1456 bp, 731 bp and 388bp using Phusion® U (ThermoFisher Scientific, Waltham, Mass.) DNApolymerase.

Products were analyzed on 1% agarose gels and the results provided inFIG. 3B-3E. The results show that the treatment of DNA with cytidinedeaminase, GT and the methylcytosine dioxgenase did not cause detectablefragmentation. In contrast, bisulfite treatment caused the DNA tofragment to fragments no larger than 731 bp.

388 (SEQ ID NO: 8) TAGGATAAAAATATAAATGTATTGTGGGATGAGG (SEQ ID NO: 9)AAAACATATAACCCCCTCCACTAATAC 731 (SEQ ID NO: 10)AGATATATTGGAGAAGTTTTGGATGATTTGG (SEQ ID NO: 11)AAAACATATAACCCCCTCCACTAATAC 1456 (SEQ ID NO: 12)TAAGATTAAGGTAGGTTGGATTTGG (SEQ ID NO: 13) TCATTACTCCCTCTCCAAAAATTAC 2018(SEQ ID NO: 14) AAGATTTAAGGGAAGGTTGAATAGG (SEQ ID NO: 15)ACCTACAAAACCTTACAAACATAAC 3325 (SEQ ID NO: 16)TGGAGTTTGTTGGGGGGTTTGTTGTTTAAG (SEQ ID NO: 17)TCTAACCCTCACCACCTTCCTAATACCCAA 4229 (SEQ ID NO: 18)TGGTAAAGGTTAAGAAGGGAAGATTGTGGA (SEQ ID NO: 19)AACCCTACTTCCCCCTAACAAATTTTCAAC

Example 3. Synthesis of an Adaptor for NGS Library Construction whereall Cytosines are Protected from Deamination in the Presence of CytidineDNA Deaminases

This example describes the experiment, confirming that pyrrolo-dC is nota substrate for cytidine deaminase, and may be used to synthesize aprotected adaptor suitable for a sequencing platform such as Illumina.

A reaction mixture was made containing 2 μM 44 bp ssDNA oligonucleotidecontaining a single Pyrrolo-dC(5′-ATAAGAATAGAATGAATXGTGAAATGAATATGAAATGAATAGTA-3′, X=Pyrrolo-dC) (SEQID NG:4), 50 mM BIS-TRIS pH6.0, 0.1% TritonX-100, 10 μg BSA, 0.2 μgRNase A, and 0.2 μM purified recombinant cytidine deaminase. This wasincubated at 37° C. for 16 hours. The DNA was recovered by using DNAClean and Concentrator™ Kit (Zymo Research, Irvine, Calif.). A mixtureof nuclease P1w, Antarctic phosphatase (and DNase I was used to digestpurified ss DNA substrate to nucleosides. LC-MS was performed on anAgilent 1200 series (G1315D Diode Array Detector, 6120 Mass Detector)(Agilent, Santa Clara, Calif.) with Waters Atlantis T3 (4.6×150 mm, 3mm, Waters, Milford, Mass.) column with in-line filter and guard column.The results are shown in FIGS. 4A and 4B. Expected peaks were observedin each sample, and no changes were detected after the treatment withcytidine deaminase (MS: m/z=265). Modified adaptor for NGS libraryconstruction was synthesized as 65-mer ss DNA using standardphosphoramidite chemistry (Glen Research Sterling, Va.) on an ABI394Synthesizer (Applied Biosystems, Foster City, Calif.). Pyrrolephosphoramidite and purification columns were purchased from GlenResearch, Sterling, Va. Oligonucleotide was deprotected according to themanufacturer's recommendations, purified using Glen-Pak DMT-ON columns,desalted using Gel-Pak size-exclusion columns.

An example of a Pyrrolo dC adaptor sequence is provided below, whereX=Pyrrolo-dC:

(SEQ ID NO: 5) 5′/5Phos/GATXGGAAGAGXAXAXGTXTGAAXTXXAGTX/deoxy/U/AXAXTXTTTXXXTAXAXGAXGXTXTTXXGATCT (also see FIGS. 4A and 4B).

Example 4. Whole Genome Methylome Analysis

To explore whether any sequence bias occurred and also efficiency of themethodology, mouse ES cell genomic DNA was sheared to 300 bp fragmentswith Covaris S2 sonicator (Covaris) for library preparation with theNEBNext® Ultra™ DNA Library Prep Kit for Illumina® according to themanufacturer's instructions for DNA end repair, methylated adapterligation, and size selection. The sample was then denatured by heat. APyrrolo-dC NEBNext adaptor (New England Biolabs, Ipswich, Mass.) wasligated to the dA-tailed DNA followed by treatment with NEB USER™ (NewEngland Biolabs, Ipswich, Mass.).

Adaptor Ligation Reaction Component μl dA-tailed DNA 65 Pyrrolo dCNEBNext adaptor (5 μM) 2 Blunt/TA Ligase Master Mix 15 Ligation Enhancer1 Total volume 83

Three libraries were created. A first library was sodium bisulfitetreated with EZ DNA Methylation-Gold Kit. A second library was treatedwith EpiTect® Bisulfite Kit Cat. No. 59104 (Qiagen, Valencia, Calif.)according to instruction manual. A third library was treated accordingto Example 1. The libraries were PCR amplified using NEBNext Q5® UracilPCR Master Mix; NEBNext Universal PCR Primer for Illumina (15 μM) andNEBNext Index PCR Primer for Illumina (15 μM) (all commerciallyavailable at New England Biolabs, Ipswich, Mass.).

TABLE 1 Suggested PCR cycle numbers for mouse ES cell genomic DNA. DNAinput Number of PCR cycles 1 μg 4~7 100 ng  8~10  50 ng  9~11

The results are shown in FIGS. 5-9.

Deaminase-seq did not display strong sequence preference whereas bothBS-seq methods produced more non-conversion errors (FIG. 5). Moreover,Deaminase-seq provided results that accurately reflected the number of Cin a DNA regardless of the nature of the adjacent nucleotide in contrastto BS-seq which showed significant biases for CA. (FIG. 6A-6D) With thesame normalized library size of 336 million reads, Deaminase-seq librarycovered 1.5 million more CpG dinucleotide sites than both BS-seqlibraries and in total has coverage for 38.0 million single CpGdinucleotide i.e., 89% of the entire mouse genome (FIG. 7).Deaminase-seq provides a more even sequencing coverage across the entiregenome with few outliers with very low or very high copy numbers (FIG.8A-8C). As a result, Deaminase-seq gives nearly 2 times as many reads asBS-seq in the CpG islands (FIG. 9), which are among the most importantgenomic regions in epigenetic studies.

A 5.4 kb fragment from glucosylated and deaminated mouse embryonic stemcell genomic DNA (chromosome 8) was sheared to 300 bp and a library ofthe fragmented DNA was made using the protocol described above andsequenced on Illumina sequencer. This method accurately identified^(hm)C at single base resolution across the entire 5.4 kb region (FIG.10).

Example 5. ^(m)C and ^(hm)C Phasing with SMRT Sequencing (PacificBiosystems)

Embodiments of the methods described have generated phased genomic mapsof epigenetic modifications over regions that are limited only by theDNA polymerase used to amplify the DNA of interest. Should amplificationnot be utilized, whole genomes could be analyzed using these methods. Atypical example is provided herein with results shown in FIGS. 11A and11B for a genomic region of 5.4 Kb.

Mouse brain genomic DNA was treated as described in FIGS. 1A and 1Bnamely by reacting aliquots of the DNA with (a) TETv+ βGT treatment (for^(m)C/^(hm)C detection) and (b) βGT treatment (for ^(hm)C detection)respectively. The products of these enzyme reactions were deaminated(cytidine deaminase e.g. APOBEC3A). A 5.4 kb fragment on chromosome 8was then amplified from the deaminated DNA by PCR. After purification,the 5.4 kb amplicons were used to construct PacBio SMRT librariesfollowing the “Amplicon template preparation and sequencing” protocol(Pacific Biosystems, Menlo Park, Calif.). One library was prepared foreach modification type and was loaded onto SMRT cell using the MagBeadmethod. The two libraries were sequenced on a PacBio RSII machine.Consensus sequences of individual sequenced molecules (Read of Insert)were generated by the “RS_ReadsOf Insert” protocol using the SMRT portaland were mapped to the mouse reference genome using the Bismarkalgorithm. The modification states of all the CpG sites across the 5.4kb were determined for individual molecule independently. The resultsshow that this 5.4 kb region was heavily methylated across the entireregion except for its 5′ end. The molecules can be divided into 2distinct populations: either hyper-methylated at 5′ end or methylationdepleted at 5′ end. In comparison, ^(hmC) exists in a few loci and ismore dynamic between molecules.

Example 6. Methylation Phasing of Long DNA Fragments (More than 10 kbLong) Using DD-Seq and Partitioning Technologies Such as 10× Genomics

ss long converted DNA fragments as describe in Example 5 are purifiedand 1 ng of the DNA is subject to 10× genomics GemCode™ Platform (10×Genomics, Pleasanton, Calif.). DNA is partitioned into droplets togetherwith droplet-based reagents. The reagent contains gel beads withmillions of copies of an oligonucleotides and a polymerase that readsthrough uracil such as Phusion U. Each oligonucleotide includes theuniversal Illumina-P5 Adaptor (Illumina, San Diego, Calif.), a barcode,Read 1 primer site and a semi-random N-mer priming sequence. Thepartitioning is done in such a way that statistically, one or several ssconverted long DNA fragments are encapsulated with one bead. The beadsdissolved after partitioning, release the oligonucleotides. Thesemi-random N-mer priming sequence anneals randomly on the ss DNAfragment and polymerase copied the template ss DNA. Droplets aredissolved, DNA is sheared through physical shearing and after end repairand dA tailing, and the right adaptor is ligated to the ss DNA.Amplification of the library is done using the standard Illumina primersand sequenced using standard Illumina protocol as well.

Example 7. Activity Comparison of mTET2CD with TETv on Genomic DNA

TET2cd (3 μM)(SEQ ID NO: 3) or TETv (SEQ ID NO:1) was added to 250 ngIMR90 gDNA (human fetal lung fibroblasts) substrate in a Tris buffer pH8.0 and the reaction was initiated with the addition of 50 μM FeSO4. Thereaction was performed for 1 hour at 37° C. Subsequently, the genomicDNA was degraded to individual nucleotides and analyzed by massspectrometry.

The results provided in FIGS. 12A and 12B show that in the absence ofenzyme, ^(m)C is the predominant modified nucleotide in the DNA with asmall amount of ^(hm)C. In the presence of mTET2CD, some but not all^(m)C was converted to ^(hm)C and a subset of these nucleotides wereconverted to ^(f)C suggesting incomplete activity and/or bias. Incontrast, TETv converted substantially all the ^(m)C to ^(ca)C with verylittle intermediate substrate. The results are shown in FIG. 12A.

Example 8: Activity of TETv on ss and ds Mouse Genomic DNA

Mouse 3T3 gDNA was sheared to 1500 bp and purified using Qiagennucleotide purification kit (Qiagen, Valencia, Calif.). Fragmented gDNAwas denatured to form ss fragments by heating at 95° C. for 5 minutesfollowed by immediate cool down on ice for 10 minutes. 250 ng sheared3T3 gDNA substrate was with TETv as described in Example 8 under similarreaction conditions. Analysis of modified bases was done according toExample 8. The results are shown in FIG. 12B.

Example 9: TETv Exhibits Very Low Sequence Bias where Analysis of 5Genomes Show that the Property is not Substrate Specific

The reaction was performed according to Example 7 using genomic DNA from5 different cell types. Low sequence specificity is preferable as itdenotes lack of sequence bias by the enzyme. The results are shown inFIG. 13.

Example 10: DNA Treated with Tetv is Intact

Mspl is sensitive to oxidized forms of ^(m)C but not ^(m)C. The reactionwas performed according to Examples 8. TETv was used at 3 μM and Hpallplasmid substrate at 100 ng. 20 U of BamHI (to linearize the plasmid)and 50 U of Mspl in CutSmart® buffer (pH 7.9) (New England Biolabs,Ipswich, Mass.) were added for 1 hour at 37° C. in μL total volume.

The reaction products were resolved on a 1.8% agarose gel. The resultsare shown in FIG. 14.

It will also be recognized by those skilled in the art that, while theinvention has been described above in terms of preferred embodiments, itis not limited thereto. Various features and aspects of the abovedescribed invention may be used individually or jointly. Further,although the invention has been described in the context of itsimplementation in a particular environment, and for particularapplications (e.g. epigenetic analysis) those skilled in the art willrecognize that its usefulness is not limited thereto and that thepresent invention can be beneficially utilized in any number ofenvironments and implementations where it is desirable to examine DNA.Accordingly, the claims set forth below should be construed in view ofthe full breadth and spirit of the invention as disclosed herein

What is claimed is: 1.-8. (canceled)
 9. A method comprising: a.determining the location of substantially all modified cytosines in atest nucleic acid fragment having a length of at least 2 kb, to obtain apattern of cytosine modification; b. comparing the pattern of cytosinemodification in the test nucleic acid fragment with the pattern ofcytosine modification in a reference nucleic acid fragment; and c.identifying a difference in the pattern of cytosine modification in thetest nucleic acid fragment in cis relative to the reference nucleic acidfragment.
 10. The method of claim 9, wherein the method comprisescomparing the pattern of cytosine modification for the test nucleic acidfragment, wherein the test nucleic acid is linked, in cis, to a gene ina transcriptionally active state to the pattern of cytosinemodifications in the same intact nucleic acid fragment that is linked,in cis, to the same gene in a transcriptionally inactive state.
 11. Themethod of claim 9, wherein transcription of the gene is correlated witha disease or condition.
 12. The method of claim 9, wherein the methodcomprises comparing the pattern of cytosine modification for a nucleicacid fragment from a patient that has a disease or condition with thepattern of cytosine modification in the same nucleic acid fragment froma patient that does not have the disease or condition.
 13. The method ofclaim 9, wherein the method comprises comparing the pattern of cytosinemodification or lack of modification for a nucleic acid fragment from apatient is undergoing a treatment with the pattern of cytosinemodification or lack of modification in the same intact nucleic acidfragment from a patient that has not been treated with the agent.
 14. Amethod according to claim 9, wherein the difference in the pattern ofcytosine modification in the test nucleic acid fragment relative to thereference nucleic acid fragment corresponds to a variant singlenucleotide polymorphism, an insertion/deletion or a somatic mutationassociated with a pathology.
 15. A method according to claim 9, whereinidentifying a difference in the pattern of cytosine modification in thetest nucleic acid fragment relative to the reference nucleic acidfragment further comprises identifying a difference in the pattern ofunmodified cytosine in cis or a difference in the pattern of modifiedcytosines in cis.